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(54) Linear scalable FFT/IFFT computation in a multi-processor system 



(57) This invention relates to a linear scalable meth- 
od for computing a Fast Fourier Transform (FFT) or In- 
verse Fast Fourier transform (IFFT) in a multiprocessing 
system using a decimation in time approach. Linear 
scalability means, as the number of processors increas- 
es by a factor P (for example), the computational cycle 
reduces by exactly the same factor P. The invention 
comprises computing the first two stages of an N-point 
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FFT/IFFT as a single radix-4 butterfly computation op- 
eration while implementing the remaining (log 2 N-2) 
stages as radix-2 operations, fusing the 3 main nested 
loops of each radix-2 butterfly stage into a single radix- 
2 butterfly computation loop, and distributing the com- 
putation of the butterflies in each stage such that each 
processor computes an equal number of complete but- 
terfly calculations thereby eliminating data interdepend- 
ency in the stage. 
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Figure 2. Butterfly distribution for 2-processor configuration 
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Description 

Field of the invention 

[0001] The present invention relates to the field of dig- 
ital signal processing. More particularly the invention re- 
lates to linearly scalable FFT/IF FT computation in a mul- 
tiprocessor system. 

Background of the invention 

[0002] The class of fourier transforms that refer to sig- 
nals that are discrete and periodic in nature are known 
as Discrete Fourier Transforms (DFT). The discrete 
Fourier transform (DFT) plays a key role in digital signal 
processing in areas such as spectral analysis, frequen- 
cy domain filtering and polyphase transformations. 
[0003] The DFT of a sequence of length N can be de- 
composed into successively smaller DFTs. The manner 
in which this principle is implemented falls into two class- 
es. The first class called "decimation in time" and the 
second called "decimation in frequency". The first de- 
rives its name from the fact that in the process of arrang- 
ing the computation into smaller transformations the se- 
quence x(n) (the index 'n' is often associated with time) 
is decomposed into successively smaller subsequenc- 
es. In the second general class the sequence of DFT 
coefficients x(k) is decomposed into smaller subse- 
quences (k denoting frequency). The present invention 
employs "decimation in time". 

[0004] Since the amount of storing and processing of 
data in numerical computation algorithms is proportional 
to the number of arithmetic operations, it is generally 
accepted that a meaningful measure of complexity, or 
of the time required to implement a computational algo- 
rithm, is the number of multiplications and additions re- 
quired. The direct computation of the DFT requires 4N 2 
real multiplications and N(4N-2) real additions. Since 
the amount of computation and thus the computation 
time is approximately proportional to N 2 it is evidentthat 
the number of arithmetic operations required to compute 
the DFT by the direct method becomes very large for 
large values of N. For this reason, computational proce- 
dures that reduce the number of multiplications and ad- 
ditions are of considerable interest. The Fast Fourier 
Transform (FFT) is an efficient algorithm for computing 
the DFT. 

[0005] The conventional method of implementing an 
FFT or Inverse Fourier Transform (I FFT) uses a radix-2 
/ radix-4 / mixed-radix approach =with either "decima- 
tion in time (DIT)" or a "decimation in frequency (DIF)" 
approach.. 

[0006] The basic computational block is called a "but- 
terfly" — a name derived from the appearance of flow 
of the computations involved in it. Fig.-1 shows atypical 
radix-2 butterfly computation. 1.1 represents the 2 in- 
puts (referred to as the "odd" and 'even" inputs) of the 
butterfly and 1 .2 refers to the 2 outputs. One of the inputs 



(in this case the odd input) is multiplied by a complex 
quantity called the twiddle factor (W N k ). The general 
equations describing the relationship between inputs 
and outputs is as follows: 

5 

X[k] = x[n] + x[n+N/2]W N k 

10 X[k+N/2] = x[n] - x[n+N/2]W N k 

[0007] An FFT butterfly calculation is implemented by 
a z-point data operation wherein 'z' is referred to as the 
"radix". An 'N' point FFT employs N/z butterfly units per 
15 stage (block) for log z N stages. The result of one butter- 
fly stage is applied as an input to one or more subse- 
quent butterfly stages. 

[0008] Computational complexity for an N-point FFT 
calculation using the radix-2 approach = 0(N/2 * log 2 N) 
20 where N is the length of the transform. There are exactly 
N/2 * log 2 N butterfly computations, each comprising 3 
complex loads, 1 complex multiply, 2 complex adds and 
2 complex stores. A full radix-4 implementation on the 
other hand requires several complex load/store opera- 
25 tions. Since only 1 store operation and 1 load operation 
are allowed per bundle of a typical VLIW processor, cy- 
cles are wasted in doing only load/store operations, thus 
reducing ILP (Instruction Level parallelism). The con- 
ventional nested loop approach requires a high looping 
30 overhead on the processor. It also makes application of 
standard optimization methods difficult. Due to the na- 
ture of the data dependencies of the conventional FFT/ 
IFFT implementations, multi cluster processor configu- 
rations do not provide much benefit in terms of compu- 
35 tational cycles. 

[0009] While the complex calculations are reduced in 
number, the time taken on a normal processor can still 
be quite large. It is therefore necessary in many appli- 
cations requiring high-speed or real-time response to re- 
40 sort to multiprocessing in order to reduce the overall 
computation time. For efficient operation, it is desirable 
to have the computation linearly scalable — in other 
words the computation time reducing in inverse propor- 
tion to the number of processors in the multiprocessing 
45 solution. Current multiprocessing implementations of 
FFT/IFFT however, do not provide such a linear scala- 
bility. 

[0010] US patent 6,366,936 describes a multiproces- 
sor approach for efficient FFT. The approach defined is 
50 a pipelined process wherein each processor is depend- 
ent on the output of the preceding processor in order to 
perform its share of work. The increase in throughput is 
not linear as compared to the n umber of processors em- 
ployed in the operation. 
55 [0011] US patent 5,293,330 describes a pipelined 
processor for mixed size FFT. Here too, the approach 
does not provide linear scalability in throughput, as it is 
pipelined. 
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[0012] A scheme for parallel FFT/IFFT as described 
in "Parallel 1-D FFT Implementation with TMS320C4x 
DSPs" by the semiconductor group-Texas Instruments, 
uses butterflies that are distributed between two proc- 
essors. In this implementation, inter processor commu- 
nication is required because subsequent computations 
on one processor depend on intermediate results from 
other processors. Every processor computes a butterfly 
operation on each of the butterfly pairs allocated to it 
and then sends half of its computed result to the proc- 
essor that needs it for the next computation step and 
then waits for the information of the same length from 
another node to arrive before continuing computation. 
This interdependence of processors for a single butterfly 
computation does not support linear increase in output 
with increase in the number of processors. 

Summary of the invention: 

[0013] The object of the present invention is to over- 
come the above drawbacks and provide linear scalabil- 
ity of throughput in a multiprocessor system. 
[0014] To achieve the aforementioned objective, the 
present invention provides a modified arrangement for 
enabling the parallel computation of different butterflies 
in different processors. 

[001 5] The invention provides a linear scalable meth- 
od for computing a Fast Fourier Transform (FFT) or In- 
verse Fast Fourier transform (I FFT) in a multiprocessing 
system using a Decimation in Time approach, compris- 
ing the steps of: 

computing the first two stages of an N-point FFT/ 
I FFT as a single radix-4 butterfly computation oper- 
ation while implementing the remaining (log 2 N-2) 
stages as radix-2 operations,- fusing the 3 main 
nested loops of each radix-2 butterfly stage into a 
single radix-2 butterfly computation loop, and 
distributing the computation of the butterflies in 
each sage such that each processor computes an 
equal number of complete butterfly calculations 
thereby eliminating data interdependency in the 
stage. 

[001 6] The said distribution of butterfly computation is 
implemented by assigning the memory locations ad- 
dresses corresponding to the inputs and outputs re- 
quired for each specific butterfly calculations to a select- 
ed processor. 

[0017] The instant invention also provides a linear 
scalable system for computing a Fast FourierTransform 
(FFT) or Inverse Fast Fourier transform (I FFT) in a mul- 
tiprocessing system using a Decimation in Time ap- 
proach, comprising: 

means for computing the first two stages of an N- 
point FFT/IFFT as a single radix-4 butterfly compu- 
tation operation while implementing the remaining 



(log 2 N-2) stages as radix-2 operations, 
means for fusing the 3 main nested loops of each 
radix-2 butterfly stage into a single radix-2 butterfly 
computation loop, and 
5 - means for distributing the computation of the but- 
terflies in each stage such that each processor com- 
putes an equal number of complete butterfly calcu- 
lations thereby eliminating data interdependency in 
the stage. 

10 

[0018] The said means for distributing the computa- 
tion of the butterflies is implemented by means for as- 
signing the memory locations addresses corresponding 
to the inputs and outputs required for specific butterfly 

15 calculations to the selected processor. 

[001 9] Further, the invention provides a computer pro- 
gram product comprising computer readable program 
code stored on a computer readable storage medium 
embodied therein for computing a Fast Fourier Trans- 

20 form (FFT) or Inverse Fast Fourier transform (IFFT) in 
a multiprocessing system using a Decimation in Time 
approach, comprising: 

computer readable program code means config- 
25 u red for computing the first two stages of an N-point 
FFT/IFFT as a single radix-4 butterfly computation 
operation while implementingthe remaining (log 2 N- 
2) stages as radix-2 operations, 
computer readable program code means config- 
30 ured for fusing the 3 main nested loops of each ra- 
dix-2 butterfly stage into a single radix-2 butterfly 
computation loop, and 

computer readable program code means config- 
ured for distributing the computation of the butter- 
35 flies in each stage such that each processor com- 
putes an equal number of complete butterfly calcu- 
lations thereby eliminating data interdependency in 
the stage. 

40 [0020] The said computer readable program code 
means configured for distributing the computation of the 
butterflies is implemented by computer readable pro- 
gram code means configured for assigning the memory 
locations addresses corresponding to the inputs and 

45 outputs required for specific butterfly calculations to a 
selected processor. 

Brief Description of the Drawings: 

50 [0021] The present invention will now be explained 
with reference to the accompanying drawings, which are 
given only by way of illustration and are not limiting for 
the present invention. 

55 Fig 1 shows the basic structure of the signal flow in 
a radix-2 butterfly computation for a discrete 
Fourier transform. 
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Fig 2 shows a 2-processor implementation of butter- 
flies for an 8-point FFT, in accordance with the 
present invention. 

Fig 3 shows a 4-processor implementation of butter- 
flies for an 8-point FFT, in accordance with the 
present invention. 

Detailed Description of the Drawings: 

[0022] Fig. 1 has already been described in the back- 
ground to the invention. 

[0023] Fig 2 shows the implementation for an 8 point 
FFT in a 2-processor architecture using the present in- 
vention. Dotted lines are computed in one processor, 
and dashed lines in the other. The computational blocks 
are represented by '0'. The left side of each computa- 
tional block is its input (the time domain samples) while 
the right side is its output (transformed samples). The 
present invention uses a mixed radix approach with dec- 
imation in time. The first two stages of the radix-2 FFT/ 
I FFT are computed as a single radix-4 stage. As these 
stages contain only load/stores and add/subtract oper- 
ations there is no need for multiplication. This leads to 
reduced time for FFT/I FFT computation as compared to 
that with full radix-2 implementation. The next stage has 
been implemented as a radix-2. The three main nested 
loops of conventional implementations have been fused 
into a single loop which iterates (N/2*(log 2 N-2))/ 
(number of processor) times. Each processor is used to 
compute one butterfly in one loop iteration. Since there 
is no data dependency between different butterflies in 
this algorithm, the computational load can be linearly di- 
vided among the different processors, leading to the lin- 
ear scalability. 

[0024] The mechanism for assigning the butterflies in 
this manner consists of assigning the memory location 
to a processor such that each processor computes a 
complete butterfly. To achieve this a binary digit is in- 
serted at the appropriate bit location in the address of 
the memory location for input/output data for the com- 
putation of the butterfly, depending on the stage of the 
FFT transformation. 

[0025] Fig 3 shows a 4-processor implementation for 
the 8-point FFT using this invention. Different line styles 
represent computation in each of the 4 processors. 
[0026] In the present implementation of the invention 
each processor comprises of one or more ALUs (Arith- 
metic Logic unit), multiplier units, data cache, and load/ 
store units. All processors share a common instruction 
cache, multi-port memory. Very Large Instruction Word 
(VLIW) processors are representative of such architec- 
tures and can be used for meeting these requirements. 
[0027] It will be apparent to those with ordinary skill in 
the art that the foregoing is merely illustrative intended 
to be exhaustive or limiting, having been presented by 
way of example only and that various modifications can 
be made within the scope of the above invention. 



[0028] Accordingly, this invention is not to be consid- 
ered limited to the specific examples chosen for purpos- 
es of disclosure, but rather to cover all changes and 
modifications, which do not constitute departures from 
5 the permissible scope of the present invention. The in- 
vention is therefore not limited by the description con- 
tained herein or by the drawings, but only by the claims. 

10 Claims 

1 . A linear scalable method for computing a Fast Fou- 
rier Transform (FFT) or Inverse Fast Fourier trans- 
form (I FFT) in a multiprocessing system using a 

15 decimation in time approach, comprising the steps 
of: 

computing the first two stages of an N-point 
FFT/IFFT as a single radix-4 butterfly compu- 
te tation operation while implementing the re- 
maining (log 2 N-2) stages as radix-2 opera- 
tions,- fusing the 3 main nested loops of each 
radix-2 butterfly stage into a single radix-2 but- 
terfly computation loop, and 
25 - distributing the computation of the butterflies in 
each sage such that each processor computes 
an equal number of complete butterfly calcula- 
tions thereby eliminating data interdependency 
in the stage. 

30 

2. A linear scalable method as claimed in claim 1 
wherein said distribution of butterfly computation is 
implemented by assigning the memory locations 
addresses corresponding to the inputs and outputs 

35 required for each specific butterfly calculations to a 
selected processor. 

3. A linear scalable system for computing a Fast Fou- 
rier Transform (FFT) or Inverse Fast Fourier trans- 

40 form (IFFT) in a multiprocessing system using a 
decimation in time approach, comprising: 

means for computing the first two stages of an 
N-point FFT/IFFT as a single radix-4 butterfly 
45 computation operation while implementing the 

remaining (log 2 N-2) stages as radix-2 opera- 
tions, 

means for fusing the 3 main nested loops of 
each radix-2 butterfly stage into a single radix- 
50 2 butterfly computation loop, and 

means for distributing the computation of the 
butterflies in each stage such that each proc- 
essor computes an equal number of complete 
butterfly calculations thereby eliminating data 
55 interdependency in the stage. 

4. A linear scalable system as claimed in claim 3 
wherein said means for distributing the computation 
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of the butterflies is implemented by means for as- 
signing the memory locations addresses corre- 
sponding to the inputs and outputs required for spe- 
cific butterfly calculations to the selected processor. 

5 

A computer program product comprising computer 
readable program code stored on a computer read- 
able storage medium embodied therein for comput- 
ing a Fast Fourier Transform (FFT) or Inverse Fast 
Fourier transform (I FFT) in a multiprocessing sys- 10 
tern using a decimation in time approach, compris- 
ing: 

computer readable program code means con- 
figured for computing the first two stages of an 15 
N-point FFT/IFFT as a single radix-4 butterfly 
computation operation while implementing the 
remaining (log 2 N-2) stages as radix-2 opera- 
tions, 

computer readable program code means con- 20 
figured for fusing the 3 main nested loops of 
each radix-2 butterfly stage into a single radix- 
2 butterfly computation loop, and 
computer readable program code means con- 
figured for distributing the computation of the 25 
butterflies in each stage such that each proc- 
essor computes an equal number of complete 
butterfly calculations thereby eliminating data 
interdependency in the stage. 

30 

The computer program product as claimed in claim 
5 wherein said computer readable program code 
means configured for distributing the computation 
of the butterflies is implemented by computer read- 
able program code means configured for assigning 35 
the memory locations addresses corresponding to 
the inputs and outputs required for specific butterfly 
calculations to a selected processor. 
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Figure 1. Butterfly, the basic computational block of FFT/IFFT 
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Figure 2« Butterfly distribution for 2-processor configuration 
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Figure 3. Butterfly distribution for 4-processor configuration 
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